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Abstract — We address the problem of opportunistic multiuser 
scheduling in downlink networks with Markov-modeled outage 
channels. We consider the scenario in which the scheduler does 
not have full knowledge of the channel state information, but 
instead estimates the channel state information by exploiting the 
memory inherent in the Markov channels along with ARQ-styled 
feedback from the scheduled users. Opportunistic scheduling is 
optimized in two stages: (1) Channel estimation and rate adap- 
tation to maximize the expected immediate rate of the scheduled 
user; (2) User scheduling, based on the optimized immediate 
rate, to maximize the overall long term sum-throughput of 
the downlink. The scheduling problem is a partially observ- 
able Markov decision process with the classic 'exploitation vs 
exploration' trade-off that is difficult to quantify. We therefore 
study the problem in the framework of Restless Multi-armed 
Bandit Processes (RMBP) and perform a Whittle's indexability 
analysis. Whittle's indexability is traditionally known to be hard 
to establish and the index policy derived based on Whittle's 
indexability is known to have optimality properties in various 
settings. We show that the problem of downlink scheduling under 
imperfect channel state information is Whittle indexable and 
derive the Whittle's index policy in closed form. Via extensive 
numerical experiments, we show that the index policy has near- 
optimal performance. 

Our work reveals that, under incomplete channel state infor- 
mation, exploiting channel memory for opportunistic scheduling 
can result in significant performance gains and that almost all 
of these gains can be realized using an easy-to-implement index 
policy. 



I. Introduction 

The wireless channel is inherently time-varying and stochas- 
tic. It can be exploited for dynamically allocating resources 
to the network users, leading to the classic opportunistic 
scheduling principle (e.g., HI). Understandably, the success 
of opportunistic scheduling depends heavily on reliable knowl- 
edge of the instantaneous channel state information (CSI) at 
the scheduler. Many sophisticated scheduling strategies have 
been developed with provably optimal characteristics (e.g., JU- 
JU) by assuming perfect CSI to be readily available, free of 
cost at the scheduler. 

In realistic scenarios, however, perfect CSI is rarely, if 
ever, available and never cost-free, i.e., a non-trivial amount 
of network resource, that could otherwise be used for data 
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transmission, must be spent in estimating the CSI J2}. This 
calls for jointly designing channel estimation and opportunistic 
scheduling strategies - an area that has recently received 
attention when the channel state is modeled by i.i.d. processes 
across time (e.g., Q, 0). The i.i.d. model has traditionally 
been a popular choice for researchers to abstract the fading 
channels, because of its simplicity and associated ease of 
analysis. On the other hand, this model fails to capture an 
important characteristics of the fading channels - the time- 
correlation or the channel memory (ff). 

In the presence of estimation cost, memory in the fading 
channels is an important resource that can be intelligently 
exploited for more efficient, joint estimation and scheduling 
strategies. In this context, Markov channel models have been 
gaining popularity as realistic abstractions of fading channels 
with memory (e.g., ll9l- lfTTI '). 

In this paper, we study joint channel estimation and schedul- 
ing using channel memory, in downlink networks. We model 
the downlink fading channels as two-state Markov Chains 
with non-zero achievable rate in both states. The scheduling 
decision at any time instant is associated with two potentially 
contradicting objectives: (1) Immediate gains in throughput via 
data transmission to the scheduled user; (2) Exploration of the 
channel of a downlink user for more informed decisions and 
associated throughput gains in the future. This is the classic 
'exploitation vs exploration' trade-off often seen in sequential 
decision making problems (e.g., Ifl2l . iTOl ). We model the joint 
estimation and scheduling problem as a Partially Observable 
Markov Decision Process (POMDP) and study the structure of 
the problem, by explicitly accounting for the estimation cost. 
Specifically, our contributions are as follows. 

• We recast the POMDP scheduling problem as a Restless 
Multi-armed Bandit Process (RMBP) OH and establish its 
Whittle 's indexability lfT4l in Section [TV] and [V] Even though 
Whittle's indexability is difficult to establish in general lfT31 . 
we have been able to show it in the context of our problem. 

• Based on a Whittle's indexability condition, we explicitly 
characterize the Whittle's index policy for the scheduling 
problem in Section [Vl] Whittle's index policies are known 
to have optimality properties in various RMBP processes and 
have been shown to be easy to implement (e.g., lfT31 . Ifl6l ). 

• Using extensive numerical experiments, we demonstrate in 
Section IVIIII that Whittle's index policy in our setting has 
near-optimal performance and that significant system-level 
gains can be realized by exploiting the channel memory for 
estimation and scheduling. Also, the Whittle's index policy we 
derive is of polynomial complexity in the number of downlink 
users (contrast this with the PSPACE-hard complexity of 
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Fig. 1. A two state Markov Chain. 

optimal POMDP solutions ifTTl ). 

Our setup significantly differs from related works (e.g., iflOl 
HD Qll) in the following sense: In these works, the channels 
are modeled by ON-OFF Markov Chains with the OFF state 
corresponding to zero-achievable rate of transmission. There, 
once a user is scheduled, there is no need to estimate the 
channel of that user, since it is optimal to transmit at the 
constant rate allowed by the ON state irrespective of the 
underlying state. In contrast, in our model, the achievable rate 
at the lower state is, in general, non-zero and any rate above 
this achievable rate leads to outage. This extended model 
captures the realistic scenario when non-zero rates are possible 
with the use of sophisticated physical layer algorithms, even 
when the channel is bad. In this model, once a user is 
scheduled, the scheduler must estimate the channel of that user, 
with an associated cost, and adapt the transmission rate based 
on the estimate. The rate adaptation must balance between 
aggressive transmissions that lead to outage and conservative 
transmissions that lead to under-utilization of channels. The 
achievable rate expected from this process, in turn, influences 
the choice of the scheduled user. Thus the channel estimation 
and scheduling stages are tightly coupled, introducing several 
technical challenges to the problem, which we address in this 
paper. 

II. System Model and Problem Statement 

A. Channel Model 

We consider a downlink system with one base station (BS) 
and N users. Time is slotted with the time slots of all users 
synchronized. The channel between the BS and each user is 
modeled as a two-state Markov chain, i.e., the state of the 
channels remains static within each time slot and evolves 
across time slots according to Markov chain statistics. The 
Markov channels are assumed to be independent and, in 
general, non-identical across users. The state space of channel 
Ci between the BS and user i is given by Si = {li,hi}. 
Each state corresponds to a maximum allowable data rate. 
Specifically, if the channel is in state k, there exists a rate Si, 
< Si < 1, such that data transmissions at rates below Si 
succeed and transmissions at rates above Si fail, i.e., outage 
occurs. Similarly, state hi corresponds to data rate 1. Note that 
fixing the higher rate to be 1 across all users does not impose 
any loss of generality in our analysis. This will be evident as 
we proceed. 

The Markovian channel model is illustrated in Fig. [T] For 
user i, the two-state Markov channel is characterized by a 2 x 2 
probability transition matrix 
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Fig. 2. Opportunistic scheduling with estimation and rate adaptation. 
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where 



Pi := prob(Cj[<]=/i 4 
r l := prob(Cj[t]=/i 4 



Ci[t-l]=hi) 
Ci[t-l]=k). 



B. Scheduling Model 

We adopt the one-hop interference model, i.e., in each time 
slot, only one user can be scheduled for data transmission. 
At the beginning of the slot, the scheduler does not have 
exact knowledge of the channel state of the downlink users. 
Instead, it maintains a belief value 7Tj for channel i which 
is the probability that Cj is in state hi at the time. We will 
elaborate on the belief values soon. Using these belief values, 
the scheduler picks a user, estimates its current channel state 
and subsequently transmits data at a rate adapted to the channel 
state estimate - all with an objective to maximize the overall 
sum-throughput of the downlink system. Specifically, in each 
slot, the scheduler jointly makes the following decisions: (1) 
Considering each user, the scheduler decides on the optimal 
channel estimator (that could involve the expenditure of net- 
work resources such as time, power, etc.) and rate adapter 
pair; (2) Based on the average rate of successful transmission 
promised for each user by the previous decision, the scheduler 
picks a user for channel estimation and subsequent data 
transmission. At the end of the slot, consistent with recent 
models (e.g., IflOl IfTTl |[T8l ). the scheduled user sends back 
accurate information on the state of the Markov channel in that 
slot. This accurate feedback is, in turn, used by the scheduler 
to update its belief on the channels, based on the Markov 
channel statistics. Note that these belief values are sufficient 
statistics to the past scheduling decisions and feedback |fl9l . 
Using to denote an arbitrary estimator and ?/ x to denote 
an arbitrary rate adapter, as functions of the belief value ir, 
the basic operation is summarized in Fig. The scheduling 
problem can be formulated as a partially observable Markov 
decision process |fl9l , with the Markov channel states being 
the partially observable system states. 

As noted in Section [I] the scheduling decision in each slot 
involves two objectives: data transmission to the scheduled 
user and probing the channel of the scheduled user (through 
the accurate end-of-slot feedback). On one hand, the scheduler 
can transmit data to the user that promises the best achievable 
rate at the moment and hence realize immediate performance 
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gains. On the other hand, the scheduler can schedule possibly 
another user and use the channel feedback from that user to 
gain a better understanding of the overall downlink system, 
which, in turn, could result in more informed future scheduling 
decisions with corresponding performance gains. 

C. Formal Problem Statement 

We now proceed to formally introduce the expected imme- 
diate reward. We let 7Tj denote the current belief value of the 
channel of user i, and let u := {e, i]} denote an arbitrary 
estimator and rate adapter pair. Recall from the discussion 
on the scheduling model that, at the end of the slot, the 
scheduled user sends back accurate feedback on its Markov 
channel state in that slot. With this setup, once a user is 
scheduled, the choice of the channel estimator and rate adapter 
pair does not affect the future paths of the scheduling process. 
Thus, within each slot, it is optimal to design this pair to 
maximize the expected rate (of successful transmission) of 
the user scheduled in that slot. Henceforth, in the language of 
POMDPs, we call this maximized rate the expected immediate 
reward. We now proceed to formally introduce the expected 
immediate reward. We let 7r,; denote the current belief value of 
the channel of user i. The optimal estimator and rate adapter 
pair, u* ff .={e* w .,i7*^.}. for user i, when the belief is 7Tj, is 
given by 

u i,*i = ar g max E Ci [li (Ci,u)], (1) 

where the quantity ji(Ci,u) is the average rate of successful 
transmissions to user i when the channel is in state C\ and the 
estimator and rate adapter pair u is deployed. The expectation 
in (Q~|i is taken over the underlying channel state Cj, with 
distribution characterized by belief value 7Tj, i.e., 

{hi with probability 7Tj, 
li with probability 1 — 7Tj . 

The expected immediate reward when user i is scheduled 
is thus given by 

Riim) = E Ci h(Ci,ul Wi )}. (2) 

Note that our model is very general in the sense that we 
do not restrict to any specific estimation, data transmission 
structure or to any specific class of estimators. A typical 
estimation, data transmission structure, corresponding to the 
estimator and rate adapter pair u is illustrated in Fig. [5] Here 
a pilot-aided training|2|-based estimation is performed for a 
fraction of the time slots followed by data transmission at an 
adapted rate in the rest of the time slots. 

We now introduce the optimality equations for the schedul- 
ing problem. Let 7?[t] = (7ri[£], • • ■ ,7rjv[t]) denote the vector 



of current belief values of the channels at the beginning of slot 
t. A stationary scheduling policy, ty, is a stationary mapping 
$ : 7? — > I between the belief vector and the index of the 
user scheduled for data transmission in the current slot. Our 
performance metric is the infinite horizon, discounted sum- 
throughput of the downlink (henceforth simply the expected 
discounted reward in the language of POMDPs), formally 
defined next. 

For a stationary policy "f, the expected discounted reward 
under initial belief 7? is given by 

oo 

y(*,7?) = /3 t£; ff[t]^/[t]=*(*[t])(^/[t]W) 
t=0 

where 7? [t] is the belief vector in slot t, iri[t] denotes the belief 
value of user i in slot t, 7? [0] — if, I[t] denotes the index of 
the user scheduled in slot t. The discount factor /3 G [0, 1) 
provides relative weighing between the immediately realizable 
rates and future rates. For any initial belief 7?, the optimal 
expected discounted reward, V(if) = max$ V^, 7?), is given 
by the Bellman equation l20l 

V(tt) = max{i? / (7r / ) + /^ + [y(7? + )]}. 

Here 7? + denotes the belief vector in the next slot when the 
current belief is 7?. The belief evolution ir — > ii + proceeds as 
follows: 

{Pi if / = i and Cj = hi 

Ti if / = i and Ci = k, (3) 

where Qi(x) = xpi + (1 — x)ri is the belief evolution operator 
for user i when it is not scheduled in the current slot. A 
stationary scheduling policy ty* is optimal if and only if 
7?) = V(n) for all tt |20l . 
In the introduction, we briefly contrasted our setup with 
those in HUliniimi. We provide a rigorous comparison here. 
The works iflOl ifTTI lfT8l studied opportunistic scheduling with 
the channels modeled by ON-OFF Markov chains. In these 
works, the lower state is an 'OFF' state, i.e., it does not allow 
transmission at any non-zero data rate. Contrast this with our 
model where, at the lower state k, a possibly non-zero rate 
5i is achievable and outage occurs at any rate above Si. We 
now further explain how these two models are fundamentally 
different. 

• In the ON-OFF channel model, the scheduler does not need 
a channel estimator and rate adapter pair. The scheduler can 
aggressively transmit at rate 1, since it has nothing to gain by 
transmitting at a lower rate - a direct consequence of the 'OFF' 
nature of the lower state. On the other hand, transmitting at 
a rate lesser than 1 can lead to losses due to under-utilization 
of the channel. 

• In contrast, in our model, when S > 0, the scheduler must 
strike a balance between aggressive and conservative rates of 
transmission. An aggressive strategy (transmit at rate 1) can 
lead to losses due to outages, while a conservative strategy 
can lead to losses due to under-utilization of the channel. This 
underscores the importance of the knowledge of the underlying 
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channel state and, therefore, the need for intelligent estimation 
and rate adaptation mechanisms. 

• As a direct consequence of the preceding arguments, the 
expected immediate reward in our model is not a trivial <5-shift 
of the expected immediate reward when the rates supported by 
the channel states are and 1 — 8. Formally, 

R^{tt) 56 RP^-Vfr) + 5={l-6)ir + 8, 

where R^ x ' v ^(n) is the immediate reward when the channel 
state space is {x, y} and belief value of the scheduled user is 
7r. In fact, it can be shown that (in Lemma Q]) 

R^{tt) < R^ - 1 -^^) + 8=(1-S)tt + 8. 

We believe that, our channel model, in contrast to the ON- 
OFF model, better captures realistic communication channels 
where, using appropriate physical layer algorithms, it is pos- 
sible to transmit at a non-zero rate even at the lowest state of 
the channel model and the same physical layer algorithms may 
impose outage behavior when this allowed rate is exceeded. 

III. Optimal Expected Transmission Rate - 
Structural Properties 

In this section, we study the structural properties of the 
expected immediate reward, Rifa), defined in Equation ||2). 
These properties will be crucial for our analysis in subsequent 
sections. For notational convenience, we will drop the suffix 
i in the rest of this section. 

Lemma 1. The expected immediate reward R{tt) has the 
following properties: 

(a) R(tt) is convex and increasing in it for tt G [0, 1] 

(b) R(tt) is bounded as follows: 

max{(5,7r} < R(tt) < (1 - S)tt + 8. (4) 

Proof: Let U* be the set of optimal estimator and rate adapter 
pairs for all tt G [0,1], i.e., U* = {<,tt G [0,1]}. The 
expected immediate reward, provided in ©, can now be 
rewritten as 

R(tt) = maxE c h{C 7 u)} 

= max[7T7(/i, u) + (1 — tt)^{1, «)], 

where 7(5, u) denotes the average rate of successful transmis- 
sion when the channel state is s G {I, h}. Note that, for fixed 
u, the average rate ttj(Ii,u) + (1 — tt)j(1,u) is linear in tt. 
Thus, i?(7r) is given as a point-wise maximum over a family 
of linear functions, which is convex ETTl . R(tt) is therefore 
convex in tt, establishing the convexity statement in (a). 

We next proceed to derive the bounds to R(ir). From 
Equation 

R(ir) = maxE c [j{C,u)] > max E c [l(C,u)) 

u {u:u={r]}} 

where {u : u = {f?}} indicates that we are considering rate 
adaptation without channel estimation. This explains the last 
inequality. Note that without the estimator, the rate adaptation 
is solely a function of the belief value tt. Thus, the average rate 
achieved under the rate adapter, conditioned on the underlying 



channel state, can be expressed simply by indicator functions, 
as seen below: 

max E c [j(C,u)] 

= max [P(C = l)rj • 1(17 < 8) + P(C = h)<n ■ l(r) < 1)] 
v 

= maxr/[P(C = I) ■ 1(17 < 5) + P(C = h) ■ l(r? < 1)] 

n 

= max {5, tt}. 

This establishes the lower bound in (b). 

The upper bound in (b) corresponds to the expected imme- 
diate reward when full channel state information is available 
at the scheduler. 

It is clear from the upper and lower bounds that 6<R(tt) < 
1. Note that when tt=0 or tt=1, there is no uncertainty in the 
channel, hence R(0)=5 and R(l)=l. Using these properties, 
along with the convexity property of R(tt), we see that R(tt) is 
monotonically increasing in tt, establishing the monotonicity 
of (a). The lemma thus follows. ■ 

Remark: Here we present some insights into the effect of the 
non-zero rate 6 on the channel estimation and rate adaptation 
mechanisms by studying the upper and lower bounds to 
R(tt) provided in Lemma Q] The upper bound essentially 
corresponds to the case when perfect channel state information 
is available at the scheduler at the beginning of each slot. 
Here, no channel estimation and rate adaptation is necessary. 
The lower bound, on the other hand, corresponds to the case 
when the channel estimation stage is eliminated and rate 
adaptation is performed solely based on the belief value tt 
of the scheduled user. 

Fig. 2] plots the lower and upper bounds to R(tt) for 
different values of 8. Note that the lower bound approaches 
the upper bound in both directions, i.e., when 8 — > or when 
8 — > 1. This behavior can be explained as follows: (1) 8 — > 1 
essentially means that the states of the Markov channel move 
closer to each other. This progressively reduces the channel 
uncertainty and hence the need for channel estimation (and, 
consequently, rate adaptation), essentially bringing the bounds 
closer. (2) As 8 — >• 0, the channel uncertainty increases. At the 
same time, the impact of the channel estimator and rate adapter 
pair decreases. This is because, as 8 — > 0, the loss in immediate 
reward due to outage (transmitting at 1 when channel is in state 
8) is less severe than the loss due to under-utilization of the 
channel (transmitting at rate 8 when the channel is in state 1), 
essentially making it optimal for the rate adaptation scheme 
to be progressively more aggressive (transmit at rate 1). Thus 
channel estimation loses its significance as 8 — > 0. This brings 
the bounds closer as 8—^0. 

It can be verified that the separation between the lower 
and upper bounds is at its peak when 8 = 0.5. This, along 
with the preceding discussion, indicates the potential for rate 
improvement when intelligent channel estimation and rate 
adaptation is performed under moderate values of 8. 

IV. Restless Multi- Armed Bandit Processes, 
Whittle's Indexability and Index Policies 

A direct analysis of the downlink scheduling problem ap- 
pears difficult due to the complex nature of the 'exploitation 
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Fig. 4. Upper and lower bounds to the average rate of successful transmission. 

vs exploration' tradeoff. We therefore establish a connection 
between the scheduling problem and the Restless Multiarmed 
Bandit Processes (RMBP) lfl4l and make use of the established 
theory behind RMBP in our analysis. We briefly overview 
RMBPs and the associated theory of Whittle's indexability in 
this section. 

RMBPs are defined as a family of sequential dynamic 
resource allocation problems in the presence of several com- 
peting, independently evolving projects. In RMBPs, a subset 
of the competing projects are served in each slot. The states 
of all the projects in the system stochastically evolve in time 
based on the current state of the projects and the action 
taken. Once a project is served, a reward dependent on the 
states of the served projects and the action taken is accrued 
by the controller. Hence, the RMBPs are characterized by a 
fundamental tradeoff between decisions guaranteeing high im- 
mediate rewards versus those that sacrifice immediate rewards 
for better future rewards. Solutions to RMBPs are, in general, 
known to be PSPACE-hard ifTTl . 

Under an average constraint on the number of projects 
scheduled per slot, a low complexity index policy developed 
by Whittle fl4l . commonly known as Whittle's index policy, 
is optimal. Under stringent constraint on the number of users 
scheduled per slot, Whittle's index policy may not exist and 
if it does exist, its optimality properties are, in general, lost. 
However, Whittle's index policies, upon existence, are known 
to have near optimal performance in various RMBPs (e.g., 
|fT31 |fl6l ). For an RMBP, Whittle's index policy exists if and 
only if the RMBP satisfies a condition known as Whittle's 
indexability lTl4ll . defined next. 

Consider the following setup: for each project P in the 
system, consider a virtual system where, in each slot, the 
controller must make one of two decisions: (1) Serve project 
P and accrue an immediate reward that is a function of the 
state of the project. This reward structure reflects the one in 
the original RMBP for project P. (2) Do not serve project P, 
i.e., stay passive and accrue an immediate reward for passivity 
u>. The state of the project P evolves in the same fashion as it 
would in the original RMBP, as a function of its current state 
and current action (whether P is served or not in the current 
state). Let D{uj) be the set of states of project P in which it 
is optimal to stay passive, where optimality is defined based 
on the infinite horizon net reward. 

Project P is Whittle indexable if and only if as uj increases 
from — oo to oo, the set D(u>) monotonically expands from 



to S, the state space of project P. The RMBP is Whittle 
indexable if and only if all the projects in the RMBP are 
Whittle indexable. 

For each state, s, of a project, Whittle's index, W(s), is 
given by the value of u in which the net reward after both the 
active and passive decisions are the same in the w-subsidized 
virtual system. The notion of indexability gives a consistent 
ordering of states with respect to the indices. For instance, if 
W (si)>W (sa) an d if it is optimal to serve the project at state 
Si, then it is optimal to serve the project at S2- This natural 
ordering of states based on indices renders the near-optimality 
properties to Whittle's index policy (e.g., fl5l . Ifl6l ). 

The downlink scheduling problem we have considered is in 
fact an RMBP process. Here, each downlink user, along with 
the belief value of its channel, corresponds to a project in 
the RMBP, and the project is served when the corresponding 
user is scheduled for data transmission. Now, referring to our 
earlier discussion on the RMBPs, we see that Whittle's index 
policy is very attractive from an optimality point of view. 
The attractiveness of the index policy can be attributed to 
the natural ordering of states (and hence projects) based on 
indices, as guaranteed by Whittle's indexability. In the rest 
of the paper, we establish that this advantage carries over to 
the downlink scheduling problem at hand. As a first step in 
this direction, in the next section, we study the scheduling 
problem in Whittle's indexability framework and show that the 
downlink scheduling problem is, in fact, Whittle indexable. 

V. Whittle's Indexability Analysis of the 
Downlink Scheduling Problem 

In this section, we study the Whittle's indexability of our 
joint scheduling and estimation problem. To that end, we first 
describe the downlink scheduling setup: 

At the beginning of each slot, based on the current belief 
value 7r (we drop the user index i in this section since 
only one user is considered throughout), the scheduler takes 
one of two possible actions: schedules data transmission to 
the user (action a = 1) or stays idle (a = 0). Upon an 
idle decision, a subsidy of uj is obtained. Otherwise, optimal 
channel estimation and rate adaptation is carried out, with a 
reward equal to R(ir) (consistent with the immediate reward 
seen in previous sections). The belief value is updated based 
on the action taken and feedback from the user (upon transmit 
decision). This belief update is consistent with that in the 
Section [U] The optimal scheduling policy (henceforth, the 
w-subsidy policy) maximizes the infinite horizon discounted 
reward, parameterized by uj. The optimal infinite horizon 
discounted reward is given by the Bellman equation l20l 



V u (ir) = max{[i?(7r) + 0(nV u (p) + (1 - 7r)VL,(r))], 
[u + PV u (Q{ir))]}, 



(5) 



where, recall from Section [TT] Q{tt) is the evolution of the 
belief value when the user is not scheduled. The first quantity 
inside the max operator corresponds to the infinite horizon 
reward when a transmit decision is made in the current slot 
and optimal decisions are made in the future slot. The second 
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element corresponds to idle decision in the current slot and 
optimal decisions in all future slots. 

We note that the indexability analysis in the rest of this sec- 
tion bears similarities to that in JT8), where the authors studied 
indexability of a sequential resource allocation problem in a 
cognitive radio setting. This problem is mathematically equiva- 
lent to our downlink scheduling problem when 6 = 0. We have 
already discussed in detail (in Section [TTJ that the structure of 
the immediate reward R(tt) when 6 > is very different 
than when 5 = 0, due to the need for channel estimation 
and rate adaptation in the former case. Consequently, in the 
Whittle's indexability setup, the infinite horizon discounted 
reward Vu,(ir) in our problem is different (and more general) 
than that in fl"8l . underscoring the significance of our results. 

As a crucial preparatory result, we now proceed to show 
that the cj-subsidy policy is thresholdable. 

A. Thresholdability of the uj-subsidy policy 

We first record our result on the convexity property of 
the infinite horizon discounted reward, V^irr), of (0 in the 
following proposition. 

Proposition 1. The infinite horizon discounted reward, V^ijr) 
is convex in 7T G [0, 1]. 

Proof: We first consider the discounted reward for finite hori- 
zon w-subsidy problem. We let ^(tt) = R(ir) and v q (tt) = uj 
represent the immediate reward corresponding to active and 
idle decisions, respectively. The reward function associated 
with ill-stage finite horizon process is expressed as 



V M (n[ 



max 

a[t], 
=0,...,M-1 



E 



M 

E 

' t=0 



7T[0] 



Let Vcj^Itt) be the reward at time t with belief value ir[t] = 
tt. Hence Vm(7t[0]) = K^olMO]) and the last stage value 
function K),a/-i(tt[^ — 1]) is given by 

V u , M - 1 (7t[M-1])= max {v^ M ~^ (tt[M - 1])} 

a[M-l] 

= max{cj,i?(7r[M- 1])}. 

Therefore, V u .m-\(tt) is convex with tt since it is the 
maximum of a constant and a convex function. For any time 
< t < M — 1, the Bellman ( 11201 ) equation can be written 

as 

V u ,Mt]) = maoc{V° !t (w[t}),V^Mt])}- 

where 

Q2, t fr)^+PV u ,t+i(Q(n)), (6) 
V^)=R{-ir)+^V u , t +i(jp) + (1 - 7r)K,, t+1 (r)). (7) 

Suppose now V^t+i^) is convex with tt. If a[t] = 1, it 
is clear from (01 that t (n) is convex function of tt since 
it is a summation of a convex function and a linear function 
of tt. If a[t] = 0, V® t (ir), expressed in ©, is also a convex 
function, because composition of convex function K, t+i (■) 
and linear function Q(tt) is convex |2"TI . Therefore 14^(71") 



is convex with tt as maximum of two convex functions. By 
induction, the convexity of V^^tt) is thus established. 

Since Vm{tt) = V u .q(tt), Vm(7t) is convex with tt. For 
discounted problem with bounded reward per slot, the infinite 
horizon reward is the limit of of finite horizon reward ([20]). 
Therefore Kj(7r) = limj\/^oo V^mM. Upon point-wise con- 
vergence, point-wise limit of convex functions is convex EH . 
Hence V u (tt) is a convex function of tt. ■ 

In the next proposition, we show that the optimal cj-subsidy 
policy is a threshold policy. 

Proposition 2. The optimal u-subsidy policy is thresholdable 
in the belief space tt. Specifically, there exists a threshold 
7r*(u;) such that the optimal action a is 1 if the current belief 
tt > tt*(uj) and the optimal action a is 0, otherwise. The value 
of the threshold tt*(lo) depends on the subsidy u, partially 
characterized below. 

(i) Ifu > 1, tt*(w) = 1; 

(ii) If lo < S, tt*(lu) = k for some arbitrary n < 0; 

(Hi) If S < uj < 1, tt*(uj) takes value within interval (0, 1). 

Proof: Consider the Bellman equation ©, let V^(tt) be the 
reward corresponding to transmit decision and V®(tt) be the 
reward corresponding to idle decision, i.e., 

KJW = AM +P(-*V u (p) + (1 - n)V u (r)), (8) 
V°(tt)=lu + I3V uj (Q(tt)) = u + j3V u ,(7rp+(l-ir)r). (9) 

It is clear from the Bellman equation © that the optimal 
action depends on the relationship between V^(tt) and V®{tt), 
presented as follows. 

Case (i). If uj > 1, since R(tt) < 1, in each slot, the immediate 
reward for being idle always dominates the reward for being 
active. Hence it will be optimal to always stay idle. We can 
thus set the threshold to 1. 

Case (ii). If uj < S, then for any tt G [0, 1], we have 

V°(tt) =lo + PV^ttp + (1 - tt)t) 

<R(tt) +/3(ttVUp) + (1 - *)V u (r)), 

where the inequality is due to 5<R(tt) along with Jensen's 
inequality |2T| due to the convexity of V^(tt) from Proposition 
2. Hence, it is optimal to stay active. Consistent with the 
threshold definition, we can set tt*(ui) = k for any k < 0. 
Case (iii). If S < w < 1, then at the extreme values of belief, 

V°(0) = lo + pV u (r) > S + pV u (r) = t£(0) 
V°(l) =to + PV U ( P ) < 1 + 0V u (jp) = V*(l) 

Note that the relationship of V®(tt) and V^(tt) is reversed 
at the end points and 1, and they are both convex functions 
of tt. Thus, there must exist a threshold tt*(lo) within (0,1) 
such that a equals 1 whenever tt > tt*(lo). ■ 

B. Whittle 's Indexability of Downlink Scheduling 

Having established that the cj-subsidy policy is thresh- 
oldable in Proposition [2] Whittle's indexability, defined in 
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Section [TVl is re-interpreted for the downlink scheduling prob- 
lem as follows: the downlink scheduling problem is Whittle 
indexable if the threshold boundary tt*(uj) is monotonically 
increasing with subsidy w. 

Using our discussion in Section [IV] the index of the belief 
value 7r, i.e., W(ir) is the infimum value of the subsidy u> such 
that it is optimal to stay idle, i.e., 



W(7r)=inf{ W :^°(7r)>yJ(7r)} 
= inf{cj : 7r*(w) = tt}. 



(10) 



To establish indexability, we need to investigate the infinite 
horizon discounted reward V u (n), given by (|5). We can 
observe from (O that given the value of V u (p) and V u (r), 
V u (tt) can be calculated for all tt <E [0,1]. Let 7r° denote the 
steady state probability of being in state h. The next lemma 
provides a closed form expression for V u (p) and V u (r) and 
is critical to the proof of indexability. 

Lemma 2. The discounted rewards V u {p) and V u (r) can be 
expressed as: 

Case 1: p > r (positive correlation) 

ifiT*(uj) <p 
ifv*(u) > P 



R(p)+l3(l-p)V LO (r) 

i-Pp 



V u (r)-- 



1-/3 



e 



if 7r*(cj)<r 

if r<n*(uj)<TT 

ifK*{u) > TT° 



The closed form expression of the value function given by 
the previous lemma serves as a useful tool for us to establish 
indexability, which is given by the next proposition. 

Proposition 3. The threshold value is strictly increasing with 
oj. Therefore, the problem is Whittle indexable. 

Proof: The proof of indexability follows the lines of lfl8l . 
Details are provided in Appendix [B] ■ 

VI. Whittle's Index Policy 

VII. Whittle's Index Policy and Numerical 
Performance Analysis 

In this section, we explicitly characterize Whittle's index 
policy for the downlink scheduling problem. For user i, let 7r? 
denote the steady state probability of being in state hi, and let 
Vi,u(vr,) denote the reward function for its w-subsidy problem 
in (O. We first characterize the Whittle's index as follows. 

Proposition 4. For user i, the index value at state 7r,, i.e., 
Wi(jTi) is characterized as follows, 

Case 1. Positively correlated channel (pi > rf) 



Ri(7Ci) if TTi > Pi 

[iii(iri)-j9iii(0 4 (jr 4 ))]+/8[7ri-/3Q i (iri)]V r < ,w i ( W4 )(P«) 



Wi(7Ti)=< 



l k +^[(l-iri)-j9(l-Oi(iri))]V r iiWi( » 4) (r i ) if tt, < vr? 
Case 2. Negatively correlated channel (pi < 



Case 2: p < r (negative correlation) 

v u (p)J " + ^ f fjy' v " (f) if P <K*(u)<Q( 

[t^ if7T*(uj)>Q( P ) 
V ( r )J 1-/3(1-) ^ ^ }<1 

The expression of O is given by Equation ( 1771 ). where Q n 
denotes n th iteration of Q and L(tt,tt* (cj)) is a function of tt 
and 7r*(w). Their expressions are given in Appendix \A\ From 
the above expressions, the closed form V^ (p) and V u (r) can be 
readily obtained. The explicit expression is space -consuming 
and therefore is moved to Appendix \A\ 

Proof: The derivation of V^ (p) and V u (r) follows from sub- 
stituting p and r in Equation ([5]). Together with the expression 
of Q(tt) given by in Section [TT] the expression of V ul (p) and 
Vu (r) can be obtained. For details, please refer to AppendixlAl 
■ 

We note that the value function expression depends on the 
correlation type of the Markov chain, because the transition 
function Q(tt) given in Section [TT] will behave differently with 
the correlation type of the chain. 



a-niR i M+i3(i--^i)Vi,w i ^ i )^i)] 



if TTi > U 

'f Qi{pi)<m<ri 



(1-/3) [fli(7r i )+ j 8[7riVi, Wi ( fft) (pi)+(l-jri)V i , Wi ( fft )(' , i)]] 

if n1<m<Qi(Pi) 

[R i (iTi)-fiMQi^i))]+P[^i-PQi{^]Vi,w l ^ i )(pi) 
{+P[(l-Ki)-P(l-Qi(m))]V itW . M (n) if in <tt? 



Proof: The derivation of the index value follows from sub- 
stituting the expression of Vi yUJi (pi) and Vi lW (Vj) (given in 
Lemma[2]i into Equation Details of the proof are provided 
in Appendix ICl ■ 

Remark: Notice that Proposition [4] does not give the closed 
form expression for Wifc). However, since the closed 
form expression of the value function V^wd-K^ijPi) an d 
Vi Wi(iri)( r i) are derived in Lemma[2] closed form expressions 
of Wi(iTi) can be easily calculated and is given in AppendixICl 
We now introduce Whittle's index policy. 

Whittle's Index Policy: In each slot, with belief values 
7Ti, . . . , 7T/v, the user I with the highest index value Wi(iTi) is 
scheduled for transmission, i.e., I = argmax,; WifjTi). 

Note that, from the definition of indexability, the index value 
Wiijti) monotonically increases with 7Tj. Therefore, when 
the Markovian channels have the same Markovian structure 



(1 - P l ^"^)lu + (1 - p)f3 L ( r >"*(. u ))[R(Q L ( r >*'( u ))(r)) + PQ L ( r <**( u »(r)V u (p)] 



e 



(1 - j3)[l - /?M^*H)+l(l - Qi(r,7r*( W ))( r ))] 



(11) 




Fig. 5. Index value evolution of user i, with Wi[0] = 0.3. (a) Positive 
correlation, pi=0.8, r;=0.2; (b) Negative correlation, pi=0.2, ?*i=0.8. 
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Belief Value:jt 
Fig. 6. Immediate reward versus 7T. 

and vary independently across users (hence the state-index 
mappings are the same across users), Whittle's index policy 
essentially becomes the greedy policy - schedule the user with 
the highest belief value. 

Fig. [5] plots an example of the index value evolution for the 
case of positively correlated and negatively correlated channels 
when they stay idle, i.e., not scheduled for transmission. We 
see that, for the positively correlated channel, the index value 
behaves monotonically, while, for the negatively correlated 
channel, the index value shows oscillation. This resembles the 
evolution of the belief values, which, as proven in Lemma [3] 
in Appendix [A] approaches steady state monotonically for 
the positively correlated channel, and with oscillation for the 
negatively correlated channel. This resemblance in Fig. [5] is 
expected since, from Proposition [3] we can infer that the index 
value monotonically increases with the belief value. Thus, in 
essence, from Proposition [3] and Fig. [5] we see that the index 
value captures the underlying dynamics of the Markovian 
channel. 

VIII. Numerical Performance Analysis 

A. Model for Simulation 

In this section, we study, via numerical evaluations, the 
performance of Whittle's index policy, henceforth simply 
the index policy, for joint estimation and scheduling in our 
downlink system. We consider the specific class of estimator 
and rate adapter structure, with pilot-aided training, discussed 
in Section [TT] and illustrated in Fig. [3] We consider a fading 
channel with the fading coefficients quantized into two levels 



Fig. 7. Performance of the index policy in comparison with that of the 
optimal policy. System parameters used: N=5, {pi=0.2, ri=0.75}, {p2 = 
0.6, r 2 = 0.25}, {p 3 = 0.8, r 3 = 0.3}, {p 4 = 0.4,r 4 =0.7}, {p 5 = 
0.65, r 5 = 0.55}; Fading block length: T = 20. 

to reflect the two states of the Markov chain. Additive noise 
is assumed to be white Gaussian. The channel input-output 
model is given by Y = hX + e, where X, Y correspond to 
transmitted and received signals, respectively, h is the complex 
fading coefficient and e is the complex Gaussian, unit variance 
additive noise. Conditioned on h, the Shannon capacity of 
the channel is given by R — log(l + \h\ 2 ). We quantize the 
fading coefficients such that the allowed rate at the lower state, 
8 = 0.2 for all users. The channel state, represented by the 
fading coefficient, evolves as Markov chain with fading block 
length T. 

We consider a class of Linear Minimum Mean Square Error 
(LMMSE) estimators |22) denoted as $. LMMSE estimators 
are attractive because with additive white Gaussian noise, they 
can be characterized in closed form l22l and, hence, can be 
conveniently used in simulation. Let cj)^ denote the optimal 
LMMSE estimator with prior {it, 1 — it}. We let $ denote the 
set of LMMSE estimators optimized for various values of ir. 

B. Immediate Reward Structure 

We now study the structure of the immediate reward R(ir). 
Note that R(ir) is optimized over the class of estimators $. 
Fig.|6]illustrates R(n), in comparison with the upper and lower 
bounds derived in Lemma [2] for two values of block length T. 
As established in Lemma |2j R(ir) shows a convex increasing 
structure and takes values within the bounds. Note that R(ir) 
also increases with T, since a larger T provides more channel 
uses for channel probing and data transmission. 

C. Near-optimal Performance of Whittle 's Index Policy 

We proceed to evaluate the performance of the index policy 
and compare it with the optimal policy. In Fig. [7] we com- 
pare the expected rewards V^L and V^ dex that, respectively, 
correspond to the optimal finite M-horizon policy and the 
index policy, for increasing horizon length M and randomly 
generated system parameters. The value of Vj£ t is obtained 
via brute-force search over the finite horizon. Fig. [7] illustrates 
the near optimal performance of the index policy. Also, as 
expected, the higher the value of /3, the higher the expected 
reward. 

Table Q] presents the performance of the index policy in 
a larger perspective. Here, with randomly generated system 
parameters, the infinite horizon reward under the index policy 
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N 





V op t 




Vnofb 


%gain 


4 


0.6337 


1.6289 


1.6289 


1.4887 


100 % 


4 


0.5896 


1.5977 


1.5866 


1.2888 


96.4045 % 


4 


0.6673 


1.6537 


1.6319 


1.4342 


90.0500 % 


5 


0.4537 


0.9854 


0.9854 


0.9299 


100 % 


5 


0.6082 


1.6132 


1.6072 


1.4777 


95.5518 % 


5 


0.6537 


2.3728 


2.3725 


2.1494 


99.8697 % 


5 


0.5397 


1.6330 


1.6330 


1.5961 


100 % 



TABLE I 

Illustration of the gains associated with exploiting channel 

MOMORY. 




Memory := p-r 

Fig. 8. Illustration of the influence of channel 'memory', (p — r), on the 
performance of the index policy and baseline policies, when /3 = 0.6. 



is compared with those of the optimal policy and a policy that 
'throws away' the feedback from the scheduled user. Let V no fb 
denote the reward under this 'no feedback' policy. The infinite 
horizon rewards are obtained as limits of the finite horizon 
until 1% convergence is achieved. The high values of the quan- 
tity %eam= V i nd " m ~.Y™ of '' xl00%, in addition to underscoring 
the near-optimality of the index policy, also signifies the high 
system-level gains from exploiting the channel memory using 
the end-of-slot feedback. 

In Fig. [8] we study the effect of the channel 'memory' on 
the performances of various baseline policies. We consider 
five users with statistically identical but independently varying 
channels. Thus pi = p, ri = r, i £ {1, • ■ ■ ,5}. We define 
the channel 'memory' as the difference p — r and increase 
the memory by increasing p from 0.5 to 1 and maintaining 
r = 1 — p. Note that, with this approach, p + r = 1. Under 
this condition, the steady state probability that a channel is 
in the higher state h is kept constant under varying chan- 
nel memory. This, essentially, provides a degree of fairness 
between systems with different channel memories. Fig. [8] 
compares the rewards V op t, Vindex an d Vnofb that respectively 
correspond to the rewards under the optimal policy, the index 
policy, and the 'no feedback' policy introduced earlier, for 
increasing channel memory. Note that when p = r, the channel 
of each user evolves Ltd. across time, with no information 
contained in the channel state feedback. Thus the policy that 
throws away this feedback achieves the same performance 
as the optimal policy that optimally uses this feedback, i.e., 
V no fb = V op t when p = r. Also, since the channels are Ltd. 
across users, when p = r, the index policy simplifies to a 
'randomized' policy that schedules randomly and uniformly 
across users, in effect mirroring the 'no feedback' policy in 
this setting. This explains V. n dex = V no fb when p = r. As the 
channel memory increases, the significance of the channel state 
feedback increases, resulting in an increasing gap between the 
policies that use this feedback (optimal and Whittle's index 
policies) and the 'no feedback' policy. 

Fig. [8] along with Table Q] shows that exploiting channel 
memory for opportunistic scheduling can result in significant 
performance gains, and almost all of these gains can be 
realized using the easy-to-implement index policy. 

IX. Conclusion 

In this paper, we have studied downlink multiuser schedul- 
ing under a Markov-modeled channel. We considered the 



scenario in which the channel state information is not perfectly 
known at the scheduler, essentially requiring a joint design of 
user selection, channel estimation and rate adaptation. This 
calls for a two-stage optimization: (1) Within each slot, the 
channel estimation and rate adaptation is optimized to obtain 
an optimal transmission rate in the scheduling slot; (2) Across 
scheduling slots, users are selected to maximize the infinite 
horizon discounted reward. We formulated the scheduling 
problem as a partially observable Markov decision processes 
with the classic 'exploitation versus exploration' trade-off. 
We then linked the problem to a restless multiarmed bandit 
processes and conducted a Whittle's indexability analysis. By 
obtaining structural properties of the optimal reward within 
the indexability setup, we showed that the downlink scheduling 
problem is Whittle indexable. We then explicitly characterized 
the Whittle's index policy and studied the performance of 
this policy using extensive numerical experiments, which 
suggest that the index policy has near optimal performance and 
significant system level gains can be realized by exploiting the 
channel memory for joint channel estimation and scheduling. 



Appendix A 
Proof of Lemma[2] 

We first establish structural properties of the belief update 
when a user stays idle. Suppose a user has the initial belief 
value 7r[0] and stays idle at all times, the belief value at t th 
slot is then given by ir[t] = Q'(7Ti[0]), where Q l is the t th 
iteration of function Q, given by 

. , , r — (p — r)*(r — (1 + ?* — p)ir) 

Q t 7T = — - — V: — — ■ < 12 > 

1 + r — p 

We let 7r° be the steady state distribution of the two-state 
channel being at the higher state, i.e., 



It is clear that 7r° = \im t ^oo Q 1 ^)- An example of 
the belief evolution when a user stays idle is depicted in 
Fig. [9] This figure shows that, when staying idle, the belief 
value approaches steady state monotonically for positively 
correlated channel and approaches steady state with oscillation 
for negatively correlated channel. The structural properties of 
<5*(7Ti[0]) is critical to the rest of the proof and is recorded in 
the following lemma. 
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Fig. 9. Evolution of belief values under consecutive idle decisions, (a) 
Positive correlation, p = 0.8, r = 0.2; (b) Negative correlation, p = 0.2, 
r = 0.8. 



Lemma 3. 

(i) For positively correlated channel (i.e., p > r), Tr[t] 
converges to steady state tt° monotonically. For negatively 
correlated channel (i.e., p < r), 7f[t] converges to steady state 
7r° with oscillation and a monotonically converging envelope. 

(ii) min{p, r} < Q'(7rj[0]) < max{p, r} for all t = 1, 2, • • • 
and m[0] 6 [0, 1]. 

Proof: (i) Since we have < p — r < 1 for positively 
correlated channel and — 1 < p — r < for negatively 
correlated channel, it is clear from the expression of (fT2l 
that 7r[i] converges to steady state ir° monotonically and ap- 
proaches steady state ir° with oscillation and a monotonically 
converging envelop. 

(ii) Since we have established part (i), it suffices to check 
that the first step transition satisfies: min{p, r} < Q[tt] < 
max-jjj, r}, for all tt, as shown below. 

r - (p-r)(r - (1 + r -p)w) 
QW = • 

1 + r — p 

For positively correlated channel, since p — r > 

r — (p — r)r 



Q(tt) >- 



1 + r — p 



r-(p-r)(r-(l+r-p)) p(l-p + r) 

<20) < r± = — r- = P- 

1 + r — p 1 + r — p 

For negatively correlated channel, since p — r < 0, 

. r — (p — r)r 



Q(tt) <- 



1 



r — p 



r-(p-r)(r-(l+r-p)) p(l-p+r) 

QW > r± L = TT = P- 

1 + r — p 1 + r — p 

The lemma is thus proved. ■ 

We then define L(ir, tt*) as the time needed for belief value 
of a user to exceed tt* from below, starting from initial value 
tt. Formally, 

L(tt,tt*) = min{Q*(7r) > tt*} 



Using Lemma [3] and expression ( fT2l . L(tt,tt*) can be 
calculated as follows. 



• Positive correlation (p > r) 
'0 



i+i 



if TT > TT* 

if TT<TT*<TTq 



if TT<TT* and 7r*>7r 

• Negative correlation (p < r) 

!0 if7r>7r*; 
1 if tt < tt* and Q(tt) > tt*, 
oo if tt < tt* and Q(tt) < tt* . 

We shall refer to the 'active set' as the set of belief values 
for which the optimal decision is to transmit. The 'idle set' 
denotes the set of belief values for which the optimal decision 
is to stay idle. We proceed to derive the value functions V u (p) 
and V u {f) based on the value of tt*(lo). 

(1) Positive correlation (p > r). 

• When tt*(lo) > p, the belief value p is thus in the 'idle 
set'. From Lemma [3jii), if tt[0] = p, the system stays idle. 
Hence the reward function is expressed as 



V w (p) = u) + 0oj + P ui H = 



1-/3 



• When 7r*(<j) < p, the belief value p is then in the 'active 
set'. Hence from the Bellman equation in ©, 

V u (p) = R(p) + p( P V u (p) + (l-p)V u (r)). 
Rearranging the terms yields, 

R(p) + (3(1 - p)VUr) 



i-Pp 



• When tt*(oj) < r, the value r is then in 'active set'. From 
Lemma [3jii), regardless of the scheduling decision, the belief 
values 7r[<], starting from tt[0] — r, stays in the 'active set'. 
Therefore 



V u (r) 



t=o 



?R{Q\r)) 



E 

t=0 



p t R (r-ip-r) t+1 r, 
\ 1+r-p J 



• When tt*(uj) > tt°, since tt° > r, the belief value r is 
in 'idle set'. From Lemma[3ji), the belief values Tr[t], starting 
from 7r[0] = r, stays in 'idle set'. Hence 

V u (r) =uj + f3L; + f3 2 Lj + --- = 

• When r < 7t*(cj) < tt°, the belief value r is therefore 
in 'idle set'. Since the channel is positively correlated, from 
Lemma [3] starting from tt [0] = r, the user remains idle for a 
duration of L(r, tt*(uj)) slots. Therefore 

Mr) ^'^ ( " u+p L ^'^V^(Q L ^^\r)). (13) 
where 

K J 1 (Q L(r '"* M) W) = R{Q L{r '** (uj) \r)) + 

P(Q Li - r ^^\r)V u (j>) + (1 - Q L ^'("»(r))V u (r)) 
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Substituting the above expression in ([T3T l, we can obtain 
Vu(r) = as given in expression ( fTTT i in the lemma. 

(2) Negative correlation (p < r). 

The derivation of (p) and (r) for negative correlation 
case follows an approach similar to that for the case of positive 
correlation. Details are, therefore, omitted here. ■ 

Note that the expressions of the value functions V u (p) and 
V u (r) in Lemma[2]are not in closed form. However, the closed 
form expressions for V u (p) and (r) can be easily calculated 
based on the expressions in Lemma |2j recorded below. 

Case (1) Positive correlation (p > r). First we give the closed 
form expression of V^(p). 

• If 7T*(w) < 7T°, 



V u (p)=Y j p t R 



r+(p-r) t+1 (l-p) 



l+r—p 

• If 7T° < 7T*(w) < P , 

(3(l-p)cj + (l-f3)R(p) 

• If tt*(w) >p, V u (p)=w/(1-P). 

We proceed to give the closed form expression of K;(r). 

• If 7r*(w) < r, 



V UJ (r) = J2l3 t R 



t+i, 



r — (p — r) 
l + r—p 



If r<n* (w)<7r°, is given in equation ( fT4b . 

If tt*(w) >7r°, V u (r)=u/(1-P). 



Case (2) Negative correlation (p < r). In this case, the closed 
form expression of V u {r) is given as follows. 

• If + (w) > r, then V u (r) = w/(l - (3). 

• If Q{p) < tt*(lo) < r, then 

Pru + {I - P)R{r) 
[r> (l-/?)(l-/?(l-r))- 

• If p < 7r*(w) < Q{p), we have 
prco + fi 2 rR{Q{p)) + (1 - p 2 Q{p))R{r) 



(1 - /3(1 - r))(l-/32Q(p)) - /? 3 r(l - Q(p)) ' 
If 7r*(w) < p, then 

r — (p — r) t+1 r s 

t=o 



V u {r) = J2j3 t R 



l + r—p 

Then we give the closed form expression of Kj(p). 
• If 7r*(w) < p, then 

'r + (p-r) t+1 (l -p) 

t=0 



K,( P )=x; J 8*ii 



l + r—p 



• If p<7t*(lu) < Q(p), we have 

(l-fl(l-r)) [w+/?i?(Q(p))]+/3 2 (l-Q(p))fl(r) 
wW (l-/?(l-r))(l-/?2Q(p))-03 r (i_Q(p)) ■ 

• If 7r*(w)>Q(p), then 7„(p) = w/(l - 0). 

Appendix B 
Proof of Proposition [3] 

We prove that the problem is Whittle indexable by showing 
that 7r*(w) monotonically increases with u>. It is clear from 
Proposition [2] that 7r*(w) = k for w S [0,5). So it suffices to 
show that 7r*(w) is strictly increasing for w e [<5, 1]. The proof 
technique follows along the lines of IfTSl and is presented next. 
We first proceed with the following lemma, where the right 
derivative of the reward function is compared. 

Lemma 4. If for all ui 6 [5, 1], we have 



< 



dV°(n) 



du! I *•=*•*+) dU! ^=^*(^)' 

then 7r*(w) is strictly increasing with uj for u £ [S, 1]. 



(15) 



Proof: The lemma is proven by contradiction. Suppose there 
exists uiq G [6,1], such that + (w) is decreasing (i.e., non- 
increasing) at wo, hence it is decreasing in a neighborhood 

of u> , say, [w ,w + Aw]. Since V* o+Au (ir*(ui + Aw)) = 
K? +Acj( 7r *( a -'o + Aw)) and 7r*(w) is decreasing at uiq, 7r*(wo) 
is within the 'active set' for the (wo + Aw)-subsidy problem. 
Therefore we have V^ +Au (tt*(u )) > C+A<»o)). 
Besides, from the definition of threshold value 7r*(wo), 
<(tt*M) - 0*(«d)). Therefore, 



= lim 

Au->0 



> lim 

Au->0 



Aw 

C+Ao,(^(^0))- 



■v° 



7T*(w )) 



Aw 



(fw It=t*(w)' 

which contradicts with the assumption. 



Therefore, to establish indexability, it suffices to prove the 
inequality O, i.e., , , < ., v Let 

Du>{Tt) be the discounted time the w-subsidy process, with 
initial belief tt, is made passive, i.e., 



D u (n)=J2p t ±(a[i\ = 0). 



{=0 

Noting that giving the value of 7r*(w), the studying the 
belief value evolution follows the same pattern as in ON/OFF 
channel case, hence the expression of D u (n) takes the 
same form as given in lfl~8l . It follows from [l4||18| that 
A"(tt) = ^r 1 - Taking derivative of the V^(tt) and V 2 (tt) 



(l-P L ( r >-"*M))(l - f3p)u+{l-[3)p L ^"^ [(1 - Pp)R(Q L ^*^(r))+PQ L ^*^(r)R(p)] 



(1 - P)(l - pp)(l - ^(V*M)+1) + (1 - p)2QL(r,W( U ))(r)pL{r,w'( U )) 



(14) 
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expressions I©-® with respect to w, the objective (fT3T l now 
becomes 

p{n*((j)D u (p)+(l-Tr*(cj))D u (r))<l+pD u (Q(n*(u))). 

(16) 

Case (1) If < 7r*(w) < mm{p, r}, from Lemma |3jii), 
starting from the initial belief value tt[0] = r or tt[0] = p, 
the believe value ir[t] never evolves below tt*(ll>), hence the 
project is active at all times under optimal control. Therefore 
D u (jp) = D u (r) = D u (Q(it*(uj))) = 0. Equation (ig thus 
holds. 

Case (2) If ttq < tt*(lj) < 1, starting from initial belief 7r[0] = 
Q(it*(uj)), the belief value 7r[£] always stays within the 'idle 
set', i.e., D u1 (Q(tt* (uj))) = jzra- Equation ( TTol l holds since 
C u (p) < l+/3+/3 2 + - • ' = t 2 ^ and, similarly, D u {r) < j^. 

Case (3) If min{p, r} < tt*(uj) < ttq, from Lemma EJii), 
<5(7r*(w)) is in 'active set'. Since 

K,(Q(tt»)) 

=i?(Q(7r») + |8[Q(7r»)V w (p) + (1 - Q(7r»))V u (r)], 
we have 

D„(Q(7T*(W))) 

=/?[Q(tt»)A,(;p) + (1 - Q(tt»))A»]. (17) 

We then discuss (T7\ separately for negatively and positively 
correlated channels. 

• Negatively correlated channel (p < r). Since r > 7r° > 
7r*(w), the belief value r is in the 'active set', hence 

V u (r) = R(r) + P(rV u (p) + (1 - r)VL(r)). 

Therefore, we have 

D w (r) = P(rD u (p) + (1 - r)£) w (r)). (18) 

Substituting equation (fTTT i and ( fT8l in (fTol l, we get 

-_-^L_^(p)(l - + ^(lu) - pQ(**(u>))) < 1. 

Following the same technique as in |[T8l , the above inequal- 
ity can be verified by substituting tt*(w) by tt° and D ul (p) by 
l 

1-/8' 

• Positively correlated channel (p > r). In this case, p is in 
the 'active set', hence 

V u (p) = R(p) + (3(pVUp) + (l-p)VUr))- 

Taking derivative with respect to oj we have, 

D u (p)=p(pD u (p) + (l-p)D u (r)). (19) 

Substituting equations dTvT > and $1% in (fTol l. we have 

1 - #p 

We note that the expression of £) w (r) takes the same form 
as in fl8l . By applying the same technique as in lfl8ll . it can 
be checked that the above inequality indeed holds. 

Therefore the inequality dT6b is justified and hence indexa- 
bility holds. ■ 



Appendix C 
Proof of Proposition^ 

For w-subsidy problem of user i, from indexability, we know 
that tt*(uj) strictly increases from to 1 as ui increases from 
5i to 1. Hence the index value, from its definition in ( TTOb . is 
the subsidy value for which the active and idle decisions are 
equally attractive. We can hence derive index value Wi(%i) by 
equating ^^(vr,) and V^^i) and solve for w as a function 
of iTi, i.e., 

=R(Ki) + PfcVwfrjipi) + (1 - n)Vi, Wt (* t )(ri)]. (20) 

Note that the expressions of Vi^{Pi) and Vi ia) (r,) have been 
given by Lemma|2] Substituting in (1201 the values of Vi lU1 (pi) 
and Vi, u (ri), we obtain the index value expressions, explained 
in the following. 

Case (1). Positively correlation (pi > r,). 

• If 7Tj > Pi, the belief value Qi(iTi), Pi, ^ are in the 
'idle set' and, starting from initial belief 7r.;[0] = Qifa) or 
7Ti[0] = pi, or 7Tj[0] = ri, 7Tj[t] will stay in the 'idle set'. 
Hence 



Substituting the above expressions in ( f20b we obtain that 

• If tt? < TTj < Pi, then Pi is in 'active set', and starting 
from initial belief 7Tj[0] = or 7r.;[0] = Qi{~Ki), iTi[t] stays 
within 'idle set' at all times. Hence 

Vi, u (Qi(iri)) = Vi, u (ri) = 

Substituting the above expressions and the expression of 
Vi,ai(pi) (given in Lemma O in d20l i, we get 

url > faiRlpi) + (i - PPi)R{*i) 
WM) = 1 + ^-^ ■ 

• If 7Ti < 7if, then the value Qi(iii) is in the 'active set'. 
Therefore, 

Vi^QM)) =R(Qi(n))+ 

p[Qi{*i)Vi,M + (i - Qi^iWiAn)]. 

Again, substituting the expression of Vi^iQiiitij) in ( [20b , 
we have 

Wifc) AR{^m(Q l {^))]+P[^PQ l ^i)]v l ,w^ ) {Pi) 

+/3[(1 - TrO-jSCl-QifTrO)]^^)^)- 

Case (2). Negative correlation (r^ > j>j). 

Using the similar approach as in the positive correlation 
case, the expressions of the index value can be derived 
for the case of negative correlation, which are given in the 
Proposition |4] Details are, therefore, omitted here. ■ 

Note that the expressions given in Proposition [4] are not 
in closed form. However, the closed form expression for the 
index value Wi(iti) can be easily calculated based on these 
expressions provided in Proposition |4] recorded as follows. 
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Case (1). Positively correlated channel (p, > n). 

• If TTj > Pi, then the index value Wi{iii) = -R,(tt,). 

• If 7T° < TTi < pi, then 



Wi(7Ti) 



PmRiipi) + (1 - pPi)Ri{ni) 



• Jf n < TTi < Tif, Wi(7Ti) is given in (f2Tb . where 

r !: = (i - /?)(i - /3p)(i - (-))+!)+ 

(1 - p) 2 Q L ( r ^( u ))(r)j3 L ^' w *^\ 
A, = - l3p)R{Q L(r ^' {u)) {r)) + 

• If 7T, < rj, the index value Wi(iTi) is given in d22b . 
Case (2). Negatively correlated channel (pi < r^.) 

• If TTi > r i, we have Wi(iTi) = i?^^). 

• If Qi{ Pi )<TTi<ri, then 

lW [l-fari][l-0(l-n)]-p(l-iri)ri ' 

• If TTi<TTi<Qi(pi), the index value is expressed as 



Wii-Ki)-- 



where 



(l-/3)fl(7r i )A i +/3(l-/3)7r i n i +/3(l-^)(l-7r i )T i 
" A i -/3(l-/3)(l-/3(l-r i )) 7 r l -(l-/3)/? 2 r. 1 (l-7r0 ' 



A i =(l-/3(l-r i ))(l-^ 2 Q i (pi))-/3 3 r i (l-Q i (p i )), (23) 
r! 4 =/3(l-/3(l-r l ))i? 4 (Q l (p l ))+/3 2 (l-Q l (p 4 ))-R l (r 4 ),(24) 
Ti = pnRi(Qifa)) + (1 - /3 2 g 4 (p,))^(^)- (25) 

• If pi < TTi < 7r°, Wi(ni) is given in (|26T >. where Aj, fi^ 
and Ti are given by d23l-(l25ll. respectively. 



• If TTi < Pi, the index value Wi(TTi) is given in equa- 
tion (|27V 
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